Memory Performance Analysis for Parallel Programs Using Concurrent Reuse Distance
نویسندگان
چکیده
Performance on multicore processors is determined largely by on-chip cache. Computer architects have conducted numerous studies in the past that vary core count and cache capacity as well as problem size to understand impact on cache behavior. These studies are very costly due to the combinatorial design spaces they must explore. Reuse distance (RD) analysis can help architects explore multicore cache performance more efficiently. One problem, however, is multicore RD analysis requires measuring concurrent reuse distance (CRD) profiles across thread-interleaved memory reference streams. Sensitivity to memory interleaving makes CRD profiles architecture dependent, undermining RD analysis benefits. But for parallel programs with symmetric threads, CRD profiles vary with architecture tractably: they change only slightly with cache capacity scaling, and shift predictably to larger CRD values with core count scaling. This enables analysis of a large number of multicore configurations from a small set of measured CRD profiles. This paper investigates using RD analysis to efficiently analyze multicore cache performance for parallel programs, making several contributions. First, we characterize how CRD profiles change with core count and cache capacity. One of our findings is core count scaling degrades locality, but 1 the degradation only impacts last-level caches (LLCs) below 16MB for our benchmarks and problem sizes, increasing to 128MB if problem size scales by 64x. Second, we apply reference groups [1] to predict CRD profiles across core count scaling, and evaluate prediction accuracy. Finally, we use CRD profiles to analyze multicore cache performance. We find predicted CRD profiles can estimate LLC MPKI within 76% of simulation for configurations without pathologic cache conflicts in 1 1200 th the time to perform simulation of the full design space.
منابع مشابه
Distance Analysis for Large - Scale Chip Multiprocessors
Title of dissertation: REUSE DISTANCE ANALYSIS FOR LARGE-SCALE CHIP MULTIPROCESSORS Meng-Ju Wu, Doctor of Philosophy, 2012 Dissertation directed by: Professor Donald Yeung Department of Electrical and Computer Engineering Multicore Reuse Distance (RD) analysis is a powerful tool that can potentially provide a parallel program’s detailed memory behavior. Concurrent Reuse Distance (CRD) and Priva...
متن کاملUnderstanding Multicore Cache Behavior of Loop-based Parallel Programs via Reuse Distance Analysis
Understanding multicore memory behavior is crucial, but can be challenging due to the cache hierarchies employed in modern CPUs. In today’s hierarchies, performance is determined by complex thread interactions, such as interference in shared caches and replication and communication in private caches. Researchers normally perform simulation to sort out these interactions, but this can be costly ...
متن کاملMemory Distance Measurement for Concurrent Programs
Memory distance analysis, the number of unique memory references made between two accesses to the same memory location, is an effective method to measure data locality and predict memory behavior. Many existing methods on memory distance measurement and analysis consider sequential programs only. With the trend towards concurrent programming, it is necessary to study the impact of memory distan...
متن کاملIs Reuse Distance Applicable to Data Locality Analysis on Chip Multiprocessors?
On Chip Multiprocessors (CMP), it is common that multiple cores share certain levels of cache. The sharing increases the contention in cache and memory-to-chip bandwidth, further highlighting the importance of data locality analysis. As a rigorous and hardware-independent locality metric, reuse distance has served for a variety of locality analysis, program transformations, and performance pred...
متن کاملLoop Fusion in High Performance Fortran Loop Fusion in High Performance Fortran
In this paper we investigate a unique problem associated with fusing loops within a High Performance Fortran (HPF) program. In particular, we discuss the issue of performing loop fusion in an HPF compiler when compiling Fortran90 array assignment statements for execution on a distributed-memory machine. During compilation of an HPF program, Fortran90 array assignment statements must be scalariz...
متن کامل